Skip to content

Conversation

michel-laterman
Copy link
Contributor

@michel-laterman michel-laterman commented Sep 12, 2025

What does this PR do?

Add the agent_policy_id and policy_revision_idx attributes to checkin requests.
These attributes are sources from the action stored as a part of the state.
Add a feature flag to disable sending acks for policy change actions; behaviour for policy change acks has not been changed with this addition (they are always sent).

Why is it important?

The policy information in fleet-server and agent may go out of sync; this may occur in cases where a VM restores from a snapshot.

Checklist

  • I have read and understood the pull request guidelines of this project.
  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in ./changelog/fragments using the changelog tool
  • I have added an integration test or an E2E test

Disruptive User Impact

N/A

Related issues

@michel-laterman michel-laterman added enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team backport-skip labels Sep 12, 2025
@michel-laterman
Copy link
Contributor Author

michel-laterman commented Sep 12, 2025

Adding integration/e2e tests requires the fleet-server to be implemented first: elastic/fleet-server#5501
I'll keep this as a draft until the above PR is merged.

I'm not changing the default behaviour for the agent with regards to acks.

Add the agent_policy_id and policy_revision_idx attributes to checkin
requests.
@michel-laterman michel-laterman force-pushed the feat/checkin-policy-details branch 2 times, most recently from 0721e6f to cfb5df4 Compare September 12, 2025 22:25
@michel-laterman michel-laterman force-pushed the feat/checkin-policy-details branch from cfb5df4 to 9535385 Compare September 18, 2025 20:24
@michel-laterman michel-laterman force-pushed the feat/checkin-policy-details branch from 85cfaef to 67a3b80 Compare September 22, 2025 17:24
Copy link

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @michel-laterman

@michel-laterman michel-laterman marked this pull request as ready for review September 22, 2025 19:44
@michel-laterman michel-laterman requested a review from a team as a code owner September 22, 2025 19:44
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-control-plane (Team:Elastic-Agent-Control-Plane)

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good, but I have 2 questions.

Copy link
Contributor

@blakerouse blakerouse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks for the clarification.

@michel-laterman
Copy link
Contributor Author

case <-f.scheduler.WaitTick():
f.log.Debug("FleetGateway calling Checkin API")
// Execute the checkin call and for any errors returned by the fleet-server API
// the function will retry to communicate with fleet-server with an exponential delay and some
// jitter to help better distribute the load from a fleet of agents.
resp, err := f.doExecute(ctx, requestBackoff)
if err != nil {
continue
}
actions := make([]fleetapi.Action, len(resp.Actions))
copy(actions, resp.Actions)
if len(actions) > 0 {
f.actionCh <- actions
}

After fleet checkin, the agent sends all actions through a channel to the dispatcher. They are executed concurrently with the checkin loop; the ticker by default has a 1s duration with up to 500ms jitter.
There is no guarantee that the POLICY_CHANGE action is executed before the next checkin.
cc @blakerouse

@blakerouse
Copy link
Contributor

@michel-laterman Thanks for the clarification from the call today. I don't think this should be an issue with this PR, but we might want to make just the policy change blocking, at least until we know its either applied or not applied. That would really reduce the load on Fleet Server, could be a scale improvement really.

We could do something like:

ctx, cancel := context.WithTimeout(ctx, 5 * time.Second)
defer cancel()
waitForPolicyApply := f.handleActions(actions)
select {
case <-waitForPolicyApply:
case <-ctx.Done():
}

@michel-laterman michel-laterman merged commit f2c4cfa into elastic:main Sep 24, 2025
23 checks passed
@michel-laterman michel-laterman deleted the feat/checkin-policy-details branch September 24, 2025 15:17
@michel-laterman
Copy link
Contributor Author

Created #10130 to track

@blakerouse
Copy link
Contributor

@michel-laterman Thanks!

v1v added a commit that referenced this pull request Sep 26, 2025
* upstream: (505 commits)
  Update journald tests now that Filebeat supports watching folders (#10131)
  [deploy/kubernetes]: add info about hostPID for Universal Profiling (#10173)
  Fall back to process runtime if otel runtime is unsupported (#10087)
  Conditionall check for ms_tls13kdf build tag (#10160)
  [docs][edot] add entry for profiles (#10163)
  edot/docs: add support for profiles (#10146)
  Add Logstash exporter (#10137)
  Add back publish to serverless. (#10159)
  Improve Integration test documentation (#10155)
  Fix multiarch service image push from main to serverless (#10129)
  Forward migrate action to endpoint (#9801)
  Comment out check for ms_tls13kdf tag for FIPS-capable binaries (#10148)
  [otel] add receivers: apache, iis, mysql, postgresql, sqlserver v0.135.0 (#9344)
  Add k8sevents receiver in kube-stack (#10086)
  feat: emit system resource metrics for EDOT subprocess (#10003)
  [AutoOps] Configure OTel Exporter to Send Maximum-sized Batches (#10126)
  keep enrollment token when replacing data with signed (#10115)
  Revert "Publish `elastic-agent-service` container directly to serverless from main (#9583)" (#10127)
  Add agent_policy_id and policy_revision_idx to checkin requests (#9931)
  remove resource/k8s processor and use k8sattributes processor for service attributes (#10108)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-skip enhancement New feature or request Team:Elastic-Agent-Control-Plane Label for the Agent Control Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fleet check-in should send policy_id and revision
3 participants